Let’s investigate our data to see what the numbers can tell us about eating fish and eating sugar and how both relate to happiness scores and incidences of anxiety & depressive disorders across many countries around the world.
Remember: not all countries were represented in each data set collected, and when refining some were lost. This means not all countries are accounted for in this investigation, though 150 countries is no small number
First we’ll import the libraries we’re going to use.
shh <- suppressPackageStartupMessages
shh(library(tidyverse))
shh(library(broom.mixed))
shh(library(lme4))
shh(library(ggeffects))
shh(library(performance))
shh(library(plotly))
You’ll notice that the data has already been cleaned and arranged in a way that makes the following operations straightforward. The original data files are available from their sources (linked on the home page), or in the github repository.
data <- tibble(read_csv('../data/happy_sad_sugar_fish.csv')) |>
arrange(country)
## Rows: 1600 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (5): year, happiness_score, fish_kg_per_person_per_year, sugar_g_per_per...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data
## # A tibble: 1,600 × 6
## country year happiness_score fish_kg_per_person_p…¹ sugar_g_per_person_p…²
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanis… 2008 37.2 0.08 23.4
## 2 Afghanis… 2009 44 0.07 23.3
## 3 Afghanis… 2010 47.6 0.07 24.2
## 4 Afghanis… 2011 38.3 0.07 24.8
## 5 Afghanis… 2012 37.8 0.07 25.4
## 6 Afghanis… 2013 35.7 0.07 24.5
## 7 Afghanis… 2014 31.3 0.19 34.7
## 8 Afghanis… 2015 39.8 0.21 32.6
## 9 Afghanis… 2016 42.2 0.23 29.9
## 10 Afghanis… 2017 26.6 0.25 32.6
## # ℹ 1,590 more rows
## # ℹ abbreviated names: ¹​fish_kg_per_person_per_year,
## # ²​sugar_g_per_person_per_day
## # ℹ 1 more variable: pct_new_per_pop_disorders <dbl>
Linear mixed effects regression models are used when not all observations are independent incorporating variables that have both fixed effects and random effects.
In our case, the fixed effects predictors are
fish_consumption, sugar_consumption, and
year. The random effects predictor is
country. The target variable is either the
happiness_score or the sadness_score.
We’ll use the lmer package to evaluate.
mixed_model <- lmer(
happiness_score ~ year + sugar_g_per_person_per_day +
fish_kg_per_person_per_year + (1 | country),
data = data)
summary(mixed_model)
## Linear mixed model fit by REML ['lmerMod']
## Formula:
## happiness_score ~ year + sugar_g_per_person_per_day + fish_kg_per_person_per_year +
## (1 | country)
## Data: data
##
## REML criterion at convergence: 9489.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.5305 -0.5675 0.0132 0.5632 4.7850
##
## Random effects:
## Groups Name Variance Std.Dev.
## country (Intercept) 91.97 9.590
## Residual 14.87 3.856
## Number of obs: 1600, groups: country, 150
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.048114 56.500736 -0.001
## year 0.025223 0.028171 0.895
## sugar_g_per_person_per_day 0.017764 0.004474 3.971
## fish_kg_per_person_per_year 0.113243 0.036758 3.081
##
## Correlation of Fixed Effects:
## (Intr) year s_____
## year -1.000
## sgr_g_pr___ 0.325 -0.331
## fsh_kg_p___ 0.098 -0.108 -0.012
Random effects:
Groups Name Variance Std.Dev.
country (Intercept) 91.97 9.590
Residual 14.87 3.856
Number of obs: 1600, groups: country, 150
The variance between countries (\(91.97\)) is much larger than the residual variance, suggesting that there are substantial differences in baseline happiness between countries.
this means that the difference in country accounts for much more of the variation in happiness score than the unexplained variation within countries.
the ratio \(91.97/(91.97 + 14.87) = 0.86\) suggests that around \(86%\) of the variation in happiness scores (after accounting for the fixed effects) is explained by differences in baseline happiness by country.
Fixed effects:
Estimate Std. Error t value
(Intercept) -0.048114 56.500736 -0.001
year 0.025223 0.028171 0.895
sugar_g_per_person_per_day 0.017764 0.004474 3.971
fish_kg_per_person_per_year 0.113243 0.036758 3.081
We gain some interesting insight when looking at the fixed effects.
There is a positive association between year,
sugar_consumption, and fish_consumption.
The \(t-values\) (calculated as \(t = Estimate / Std. Error\)) will tell us what is statistically significant and what is not. As a rule of thumb, when using mixed effects models:
$t$Â > 2 suggests statistical significance at approximately the \(p = 0.05\) level.
$t$Â > 2.6 suggests significance at approximately the 0.01 level.
$t$Â > 3.3 suggests significance at approximately the 0.001 level.
Based on the \(t\) values, we
conclude that year is not statistically
significant, fish_consumption is very
statistically significant, and sugar_consumption is
highly statistically significant.
# Extract predictions while varying one predictor at a time
effects_year <- ggpredict(mixed_model, terms = "year")
effects_sugar <- ggpredict(mixed_model, terms = "sugar_g_per_person_per_day")
effects_fish <- ggpredict(mixed_model, terms = "fish_kg_per_person_per_year")
# Plot each effect separately
plot1 <- ggplot(effects_year, aes(x = x, y = predicted)) +
geom_line(color = "blue") +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2) +
labs(title = "Effect of Year on\nHappiness Score", x = "Year", y = "Predicted Happiness Score")
plot2 <- ggplot(effects_sugar, aes(x = x, y = predicted)) +
geom_line(color = "red") +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, fill = "red") +
labs(title = "Effect of Sugar Consumption\non Happiness Score", x = "Sugar (g per person/day)", y = "Predicted Happiness Score")
plot3 <- ggplot(effects_fish, aes(x = x, y = predicted)) +
geom_line(color = "orange") +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, fill = "green") +
labs(title = "Effect of Fish Consumption\non Happiness Score", x = "Fish (kg per person/year)", y = "Predicted Happiness Score")
# Arrange plots in a grid
library(patchwork)
plot1 + plot2 + plot3
This is a caterpillar plot or forest plot. it visually represents the estimated random intercepts for each country.
Values to the right of zero (positive values):
Countries with positive random effect estimates (points to the right of
the red dashed line) have a higher intercept than the average intercept
in the model. In simpler terms, after accounting for year,
sugar_g_per_person_per_day, and
fish_kg_per_person_per_year, these countries tend to have a
higher baseline happiness score compared to the average country in this
dataset.
Values to the left of zero (negative values): Countries with negative random effect estimates (points to the left of the red dashed line) have a lower intercept than the average intercept. They tend to have a lower baseline happiness score than the average country, even after considering the fixed effects.
Values close to zero: Countries with random effects close to zero have baseline happiness scores that are very close to the average baseline happiness score, after accounting for the fixed effects. Their happiness is already well-explained by the fixed predictors in the model, and there’s not much country-specific deviation from the average intercept needed.
# Get random effects
ranef_df <- ranef(mixed_model)$country %>%
as.data.frame() %>%
rownames_to_column("country") %>%
rename(effect = "(Intercept)") %>%
arrange(effect)
# Plot random effects with confidence intervals
ggplot(ranef_df, aes(y = reorder(country, effect), x = effect)) +
geom_point() +
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
theme_minimal() +
labs(x = "Random Effect Estimate", y = "Country",
title = "Country Random Effects with 95% CI")
# Residuals vs. fitted
augmented_data <- augment(mixed_model)
ggplot(augmented_data, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
geom_smooth(method = "loess", se = FALSE, color = "blue") +
theme_minimal() +
labs(x = "Fitted values", y = "Residuals",
title = "Residuals vs Fitted Values")
## `geom_smooth()` using formula = 'y ~ x'
# QQ plot of residuals
ggplot(augmented_data, aes(sample = .resid)) +
stat_qq() +
stat_qq_line() +
theme_minimal() +
labs(title = "Normal Q-Q Plot of Residuals")
# Convert ggplot to interactive plot
p <- ggplot(ranef_df, aes(y = reorder(country, effect),
x = effect,
text = paste("Country:", country,
"\nEffect:", round(effect, 2)))) +
geom_point() +
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
theme_minimal() +
labs(title = "Country Random Effects with 95% CI",
x = "Random Effect Estimate", y = "Country")
ggplotly(p, tooltip = "text")